Data Visualization


There are a seemingly infinite number of different tools for data visualization in Python. For today, we're going to focus on Matplotlib and Seaborn.

Matplotlib is a standard, Python, 2D plotting library (https://matplotlib.org/)
Seaborn is also a Python, data visualization library built atop Matplotlib (https://seaborn.pydata.org/)

We'll also delve into some work with geographic plotting using geopandas (http://geopandas.org/) and bokeh (https://bokeh.pydata.org/en/latest/index.html).


In [1]:
# rendering our plots inline (aka, in our Jupyter notebook) and changing the layout a bit

%matplotlib inline
%config InlineBackend.figure_format = 'retina'
In [2]:
# installing all of our libraries

import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
In [3]:
# setting some more styling

sns.set_style("whitegrid")
sns.set(rc={'figure.figsize': (20, 20)})
matplotlib.style.use(['seaborn-talk', 'seaborn-ticks'])

We'll pick up where we left off last class using our NYPD Crash Data.

In [4]:
data = pd.read_csv("./NYPD_Crashes.csv",low_memory=False)
In [5]:
data.dtypes
Out[5]:
DATE                              object
TIME                              object
BOROUGH                           object
ZIP CODE                          object
LATITUDE                         float64
LONGITUDE                        float64
LOCATION                          object
ON STREET NAME                    object
CROSS STREET NAME                 object
OFF STREET NAME                   object
NUMBER OF PERSONS INJURED        float64
NUMBER OF PERSONS KILLED         float64
NUMBER OF PEDESTRIANS INJURED      int64
NUMBER OF PEDESTRIANS KILLED       int64
NUMBER OF CYCLIST INJURED          int64
NUMBER OF CYCLIST KILLED           int64
NUMBER OF MOTORIST INJURED         int64
NUMBER OF MOTORIST KILLED          int64
CONTRIBUTING FACTOR VEHICLE 1     object
CONTRIBUTING FACTOR VEHICLE 2     object
CONTRIBUTING FACTOR VEHICLE 3     object
CONTRIBUTING FACTOR VEHICLE 4     object
CONTRIBUTING FACTOR VEHICLE 5     object
COLLISION_ID                       int64
VEHICLE TYPE CODE 1               object
VEHICLE TYPE CODE 2               object
VEHICLE TYPE CODE 3               object
VEHICLE TYPE CODE 4               object
VEHICLE TYPE CODE 5               object
dtype: object

Again, we need to take a moment and convert some of our dtypes:

In [6]:
data['DATETIME'] = data.DATE + ' ' + data.TIME # create a new field called 'datetime' that combines date and time
data.DATETIME = pd.to_datetime(data.DATETIME, format="%m/%d/%Y %H:%M") # format this new column as a datetime
In [7]:
data.TIME = pd.to_datetime(data.TIME, format="%H:%M")
In [8]:
data.DATE = pd.to_datetime(data.DATE, format="%m/%d/%Y")
In [9]:
# we'll also create two new columns, 'injury' and 'death' 

data['INJURY'] = (data['NUMBER OF PERSONS INJURED']>0) # true if there's at least one injury, false if otherwise
data['DEATH'] = (data['NUMBER OF PERSONS KILLED']>0) # true if there's at least one death, false if otherwise

Overplotting

In [10]:
data.plot(kind='scatter', x='LONGITUDE', y='LATITUDE')
'c' argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with 'x' & 'y'.  Please use a 2-D array with a single row if you really want to specify the same RGB or RGBA value for all points.
Out[10]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a23353390>
In [11]:
clean_mask = (data.LATITUDE > 40) & (data.LATITUDE < 41) & (data.LONGITUDE < -72) & (data.LONGITUDE > -74.5)
cleandf = data[clean_mask]
In [12]:
cleandf.plot(kind='scatter', x='LONGITUDE', y='LATITUDE')
'c' argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with 'x' & 'y'.  Please use a 2-D array with a single row if you really want to specify the same RGB or RGBA value for all points.
Out[12]:
<matplotlib.axes._subplots.AxesSubplot at 0x103b65550>
In [13]:
cleandf.plot(kind='scatter', x='LONGITUDE', y='LATITUDE', figsize=(20, 15))
'c' argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with 'x' & 'y'.  Please use a 2-D array with a single row if you really want to specify the same RGB or RGBA value for all points.
Out[13]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a171e2c18>

Addressing Overplotting

In [14]:
# sampling – remember, we can either specify the number of data points, or the percentage of the dataset that 
# we want to keep.

# keep 10,000 data points
sample = cleandf.sample(n=10000)

sample.plot(kind='scatter', x='LONGITUDE', y='LATITUDE', figsize=(20, 15))
'c' argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with 'x' & 'y'.  Please use a 2-D array with a single row if you really want to specify the same RGB or RGBA value for all points.
Out[14]:
<matplotlib.axes._subplots.AxesSubplot at 0x103c04898>
In [15]:
# keep 1% of the dataset
sample = cleandf.sample(frac=0.01)

sample.plot(kind='scatter', x='LONGITUDE', y='LATITUDE', figsize=(20, 15))
'c' argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with 'x' & 'y'.  Please use a 2-D array with a single row if you really want to specify the same RGB or RGBA value for all points.
Out[15]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a186bb2b0>
In [16]:
# altering the marker size:

cleandf.plot(kind='scatter', x='LONGITUDE', y='LATITUDE', figsize=(20, 15), s=0.5 )
'c' argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with 'x' & 'y'.  Please use a 2-D array with a single row if you really want to specify the same RGB or RGBA value for all points.
Out[16]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a1b7e7358>
In [17]:
# altering the marker transparency:

cleandf.plot(
    kind='scatter',
    x='LONGITUDE',
    y='LATITUDE',
    figsize=(20, 15),
    s=0.5,
    alpha=0.05)
'c' argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with 'x' & 'y'.  Please use a 2-D array with a single row if you really want to specify the same RGB or RGBA value for all points.
Out[17]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a17bc18d0>

Histograms, Density Plots, and Contour Plots

The hexbin (Hexagonal Bin Plot) creates a 2-d histogram, where the color signals the number of points within a particular area; The gridsize parameter chooses the size of each bin.

In [18]:
cleandf.plot(
    kind='hexbin',
    x='LONGITUDE',
    y='LATITUDE',
    gridsize=100,
    cmap=plt.cm.Blues,
    figsize=(15, 12))
Out[18]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a17d71ac8>

Density Plots

In [19]:
plt.subplots(figsize=(20, 15))

sample = cleandf.sample(10000) # take sample because density plots take a while to computer

sns.kdeplot(
    sample.LONGITUDE,
    sample.LATITUDE,
    gridsize=100,  # controls the resolution
    cmap=plt.cm.rainbow,  # color scheme
    shade=  # whether to have a density plot (True), or just the contours (False)
    True,
    alpha=0.5,
    shade_lowest=False,
    n_levels=50  # how many contours/levels to have
)
Out[19]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a18f7f588>

Contour Plots

In [20]:
plt.subplots(figsize=(20, 15))

sample = cleandf.sample(10000)

sns.kdeplot(
    sample.LONGITUDE,
    sample.LATITUDE,
    gridsize=100,
    cmap=plt.cm.rainbow,
    shade=False,
    shade_lowest=False,
    n_levels=25)
Out[20]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a1a03eb70>

Combining plots

We can combine multiple plots using the ax parameter (think of 'ax' as representative of an individual plot).

In [21]:
# imagine we want to combine the scatter plot with the contour plot above...

sample = cleandf.sample(10000)

scatterplot = cleandf.plot(
    kind='scatter',
    x='LONGITUDE',
    y='LATITUDE',
    figsize=(20, 15),
    s=0.5,
    alpha=0.1)

sns.kdeplot(
    sample.LONGITUDE,
    sample.LATITUDE,
    gridsize=100,
    cmap=plt.cm.rainbow,
    shade=False,
    shade_lowest=False,
    n_levels=20,
    alpha=1,
    ax=scatterplot)
'c' argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with 'x' & 'y'.  Please use a 2-D array with a single row if you really want to specify the same RGB or RGBA value for all points.
Out[21]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a172271d0>

Adding Geographic Boundaries using Bokeh

In [22]:
data = pd.read_csv("./NYPD_Crashes.csv",low_memory=False) # make sure we're reading our raw data
In [23]:
data.dropna(subset=["LATITUDE","LONGITUDE"],inplace=True)

data.head()
Out[23]:
DATE TIME BOROUGH ZIP CODE LATITUDE LONGITUDE LOCATION ON STREET NAME CROSS STREET NAME OFF STREET NAME ... CONTRIBUTING FACTOR VEHICLE 2 CONTRIBUTING FACTOR VEHICLE 3 CONTRIBUTING FACTOR VEHICLE 4 CONTRIBUTING FACTOR VEHICLE 5 COLLISION_ID VEHICLE TYPE CODE 1 VEHICLE TYPE CODE 2 VEHICLE TYPE CODE 3 VEHICLE TYPE CODE 4 VEHICLE TYPE CODE 5
3 02/06/2013 13:10 NaN NaN 40.874499 -73.927609 POINT (-73.9276086 40.8744985) NaN NaN NaN ... NaN NaN NaN NaN 2874747 SPORT UTILITY / STATION WAGON NaN NaN NaN NaN
8 09/13/2019 21:15 BROOKLYN 11215 40.671160 -73.971420 POINT (-73.97142 40.67116) NaN NaN 18 PROSPECT PARK WEST ... NaN NaN NaN NaN 4206285 Sedan NaN NaN NaN NaN
17 02/05/2013 12:30 NaN NaN 40.697735 -73.814079 POINT (-73.8140794 40.6977347) NaN NaN NaN ... Unspecified NaN NaN NaN 2997767 PASSENGER VEHICLE PASSENGER VEHICLE NaN NaN NaN
18 02/05/2013 18:10 NaN NaN 40.606328 -74.079930 POINT (-74.0799299 40.6063283) NaN NaN NaN ... Unspecified NaN NaN NaN 3122183 PASSENGER VEHICLE PASSENGER VEHICLE NaN NaN NaN
19 02/05/2013 18:40 NaN NaN 40.584775 -73.960108 POINT (-73.9601076 40.5847755) NaN NaN NaN ... NaN NaN NaN NaN 2940781 PASSENGER VEHICLE NaN NaN NaN NaN

5 rows × 29 columns

In [24]:
lat_long = data[["LATITUDE","LONGITUDE","DATE","TIME","BOROUGH","VEHICLE TYPE CODE 1"]]
In [25]:
lat_long.head()
Out[25]:
LATITUDE LONGITUDE DATE TIME BOROUGH VEHICLE TYPE CODE 1
3 40.874499 -73.927609 02/06/2013 13:10 NaN SPORT UTILITY / STATION WAGON
8 40.671160 -73.971420 09/13/2019 21:15 BROOKLYN Sedan
17 40.697735 -73.814079 02/05/2013 12:30 NaN PASSENGER VEHICLE
18 40.606328 -74.079930 02/05/2013 18:10 NaN PASSENGER VEHICLE
19 40.584775 -73.960108 02/05/2013 18:40 NaN PASSENGER VEHICLE
In [26]:
test = lat_long[:100]
In [27]:
test
Out[27]:
LATITUDE LONGITUDE DATE TIME BOROUGH VEHICLE TYPE CODE 1
3 40.874499 -73.927609 02/06/2013 13:10 NaN SPORT UTILITY / STATION WAGON
8 40.671160 -73.971420 09/13/2019 21:15 BROOKLYN Sedan
17 40.697735 -73.814079 02/05/2013 12:30 NaN PASSENGER VEHICLE
18 40.606328 -74.079930 02/05/2013 18:10 NaN PASSENGER VEHICLE
19 40.584775 -73.960108 02/05/2013 18:40 NaN PASSENGER VEHICLE
... ... ... ... ... ... ...
352 40.657020 -74.004731 01/10/2013 7:57 NaN PASSENGER VEHICLE
364 40.771122 -73.869635 01/04/2013 5:30 NaN PASSENGER VEHICLE
368 40.781667 -73.986689 01/03/2013 18:30 NaN PASSENGER VEHICLE
369 40.816883 -73.962650 08/10/2019 3:40 NaN Sedan
370 40.610745 -74.095829 01/03/2013 22:10 NaN PASSENGER VEHICLE

100 rows × 6 columns

In [28]:
lat_list = list(test['LATITUDE'])
lon_list = list(test['LONGITUDE'])

date_list = list(test['DATE'])
time_list = list(test['TIME'])
borough_list = list(test['BOROUGH'])
vehicle_list = list(test['VEHICLE TYPE CODE 1'])
In [29]:
# https://docs.bokeh.org/en/latest/

from bokeh.io import output_file, show 
from bokeh.models import *


map_options = GMapOptions(lat=40.7128, lng=-74.0060, map_type="roadmap", zoom=11)

plot = GMapPlot(x_range=Range1d(), y_range=Range1d(), map_options=map_options,api_key = "")

source = ColumnDataSource(
    data = dict(
        lat=lat_list,
        lon=lon_list,
        date = date_list,
        time = time_list,
        borough = borough_list, 
        vehicle = vehicle_list
    ))

circle = Circle(x="lon", y="lat", size=15, fill_color="blue", fill_alpha=0.8, line_color=None)
plot.add_glyph(source, circle)

plot.add_tools(PanTool(), WheelZoomTool(), BoxSelectTool(), BoxZoomTool())

plot.title.text="NYC Accidents"

plot.add_tools(HoverTool(
    tooltips=[
        ( 'date',   '@date' ),
        ( 'time',  '@time' ), 
        ( 'borough', '@borough' ), 
        ( 'vehicle', '@vehicle' )
    ],

    formatters={
        'date' : 'datetime', # use 'datetime' formatter for 'date' field
        'time' : 'printf',
        'borough' : 'numeral',
        'vehicle' : 'numeral'
    },

    mode='vline'
))

# output_file("gmap_plot.html")

show(plot)

Example: Analyzing Citibike Station Activity using Pandas

In [30]:
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
import pandas as pd
import matplotlib 
import matplotlib.pyplot as plt
matplotlib.style.use(['seaborn-talk', 'seaborn-ticks', 'seaborn-whitegrid'])

First, let's fetch our data as we did in week 1:

In [31]:
import pandas as pd
import sqlite3

con = sqlite3.connect('citibikeDataForViz.db') # connect to our database

Unlike in Week 1, though, we are using a script that runs continuously using a crontab (seen below) so that our database is continually populating with recent data.

The .py script is called citibike_cron_script.py and can be found in the Class 7 folder of the course repo.

The crontab used is:

/1 * /Users/siegmanA/anaconda3/bin/python $(which python3) ~/Desktop/NYU-Projects-in-Programming-Fall-2019/(Class\ 7)\ Data\ Visualization/citibike_cron_script.py >> ~/Desktop/tmp/citiCron.log 2>&1

Now we want to create a query that gets us the average capacity of a given station in hourly intervals.

In [32]:
check = pd.read_sql("SELECT * FROM StationsData LIMIT 3", con=con)
check
Out[32]:
station_id stationName availableDocks totalDocks latitude longitude statusValue statusKey availableBikes stAddress1 stAddress2 city postalCode location altitude testStation lastCommunicationTime landMark
0 281 Grand Army Plaza & Central Park S 14 66 40.764397 -73.973715 In Service 1 51 Grand Army Plaza & Central Park S 0 2019-10-09 10:27:28 AM
1 301 E 2 St & Avenue B 28 58 40.722174 -73.983688 In Service 1 28 E 2 St & Avenue B 0 2019-10-09 10:28:00 AM
2 304 Broadway & Battery Pl 10 33 40.704633 -74.013617 In Service 1 23 Broadway & Battery Pl 0 2019-10-09 10:25:26 AM
In [33]:
df = pd.read_sql("""SELECT station_id,
                    stationName,
                    availableBikes, 
                    availableDocks,
                    totalDocks,
                    latitude, 
                    longitude,
                    lastCommunicationTime
                FROM StationsData""", con=con)

df['lastCommunicationTime'] = pd.to_datetime(df['lastCommunicationTime'], format='%Y-%m-%d %H:%M:%S %p')

df.head()
Out[33]:
station_id stationName availableBikes availableDocks totalDocks latitude longitude lastCommunicationTime
0 281 Grand Army Plaza & Central Park S 51 14 66 40.764397 -73.973715 2019-10-09 10:27:28
1 301 E 2 St & Avenue B 28 28 58 40.722174 -73.983688 2019-10-09 10:28:00
2 304 Broadway & Battery Pl 23 10 33 40.704633 -74.013617 2019-10-09 10:25:26
3 305 E 58 St & 3 Ave 8 24 33 40.760958 -73.967245 2019-10-09 10:25:53
4 337 Old Slip & Front St 9 26 37 40.703799 -74.008387 2019-10-09 10:25:18
In [34]:
len(df)
Out[34]:
993763
In [35]:
df.tail()
Out[35]:
station_id stationName availableBikes availableDocks totalDocks latitude longitude lastCommunicationTime
993758 3855 Frost St & Debevoise Ave 1 21 22 40.71882 -73.93948 2019-10-11 03:06:11
993759 3857 Engert Ave & McGuinness Blvd 0 19 19 40.72158 -73.94546 2019-10-11 03:04:05
993760 3858 Kent St & McGuinness Blvd 11 9 20 40.73124 -73.95161 2019-10-11 03:05:54
993761 3860 Wilson Ave & Troutman St 5 21 26 40.70166 -73.92754 2019-10-11 03:07:19
993762 3861 Menahan St & Wyckoff Ave 1 19 20 40.70113 -73.91422 2019-10-11 03:05:24

Exercise 1:

Create a new column in our df called 'percent_full' that tells us how full a bike station is at a given time

In [36]:
# let's see how 'full' each bike station is at a given time

df['percent_full'] = df['availableBikes']/df['totalDocks']

Examining Time Series per Station

Let's create a pivot table to examine the time series for individual stations.

In [37]:
station_timeseries = df.pivot_table(
                        index='lastCommunicationTime', 
                        values='percent_full', 
                        aggfunc='mean'
                    ).interpolate(method='pad') 

station_timeseries.head(5)
Out[37]:
percent_full
lastCommunicationTime
1969-12-31 07:00:00 0.0
2019-09-01 01:30:49 0.0
2019-09-12 08:57:47 0.0
2019-09-27 07:50:04 0.0
2019-10-04 10:29:30 0.0

It looks like there's an erroneous entry where we have a last communication time from 1969. Let's get rid of that.

In [38]:
df = df[df.lastCommunicationTime != '1969-12-31 07:00:00']
In [39]:
station_timeseries = df.pivot_table(
                        index='lastCommunicationTime', 
                        values='percent_full', 
                        aggfunc='mean'
                    ).interpolate(method='time') 

station_timeseries.tail()
Out[39]:
percent_full
lastCommunicationTime
2019-10-11 12:59:55 0.362667
2019-10-11 12:59:56 0.244746
2019-10-11 12:59:57 0.356978
2019-10-11 12:59:58 0.400218
2019-10-11 12:59:59 0.663293
In [40]:
# then, let's plot that over time 

%matplotlib inline

station_timeseries.plot(alpha=.5, figsize=(18, 9), ylim=(0,1), xlim=('2019-10-10 06', '2019-10-10 06:30'))
CPU times: user 2 µs, sys: 1 µs, total: 3 µs
Wall time: 4.05 µs

Let's limit our plot to just two stations:

  • Station 3260 at "Mercer St & Bleecker St"
  • Station 161 at "LaGuardia Pl & W 3 St"

which are nearby and tend to exhibit similar behavior. Remember that the list of stations is available as a JSON

In [41]:
df[df.stationName.str.contains("Mercer") & df.stationName.str.contains("Bleecker") ].head()
Out[41]:
station_id stationName availableBikes availableDocks totalDocks latitude longitude lastCommunicationTime percent_full
439 3260 Mercer St & Bleecker St 45 0 45 40.727064 -73.996621 2019-10-09 10:24:20 1.0
1304 3260 Mercer St & Bleecker St 45 0 45 40.727064 -73.996621 2019-10-09 11:08:08 1.0
2170 3260 Mercer St & Bleecker St 45 0 45 40.727064 -73.996621 2019-10-09 11:44:26 1.0
3036 3260 Mercer St & Bleecker St 45 0 45 40.727064 -73.996621 2019-10-09 11:48:25 1.0
3902 3260 Mercer St & Bleecker St 45 0 45 40.727064 -73.996621 2019-10-09 11:48:25 1.0
In [42]:
df[df.stationName.str.contains("LaGuardia") ].head()
Out[42]:
station_id stationName availableBikes availableDocks totalDocks latitude longitude lastCommunicationTime percent_full
40 161 LaGuardia Pl & W 3 St 28 6 35 40.72917 -73.998102 2019-10-09 10:26:53 0.800000
905 161 LaGuardia Pl & W 3 St 29 5 35 40.72917 -73.998102 2019-10-09 11:06:55 0.828571
1771 161 LaGuardia Pl & W 3 St 30 4 35 40.72917 -73.998102 2019-10-09 11:48:08 0.857143
2637 161 LaGuardia Pl & W 3 St 30 4 35 40.72917 -73.998102 2019-10-09 11:48:08 0.857143
3503 161 LaGuardia Pl & W 3 St 30 4 35 40.72917 -73.998102 2019-10-09 11:48:08 0.857143
In [43]:
station_timeseries = df.pivot_table(
                        index='lastCommunicationTime', 
                        columns='station_id',
                        values='percent_full', 
                        aggfunc='mean'
                    ).interpolate(method='time') 

station_timeseries.tail()
Out[43]:
station_id 72 79 82 83 116 119 120 127 128 143 ... 3850 3851 3852 3853 3854 3855 3857 3858 3860 3861
lastCommunicationTime
2019-10-11 12:59:55 0.945455 0.666667 0.111111 0.919355 0.871795 0.842105 0.105263 0.064516 0.866667 0.666667 ... 1.0 0.1 0.294118 0.25 0.0 0.0 0.0 0.55 0.0 0.0
2019-10-11 12:59:56 0.945455 0.666667 0.111111 0.919355 0.871795 0.842105 0.105263 0.064516 0.866667 0.666667 ... 1.0 0.1 0.294118 0.25 0.0 0.0 0.0 0.55 0.0 0.0
2019-10-11 12:59:57 0.945455 0.666667 0.111111 0.919355 0.871795 0.842105 0.105263 0.064516 0.866667 0.666667 ... 1.0 0.1 0.294118 0.25 0.0 0.0 0.0 0.55 0.0 0.0
2019-10-11 12:59:58 0.945455 0.666667 0.111111 0.919355 0.871795 0.842105 0.105263 0.064516 0.866667 0.666667 ... 1.0 0.1 0.294118 0.25 0.0 0.0 0.0 0.55 0.0 0.0
2019-10-11 12:59:59 0.945455 0.666667 0.111111 0.919355 0.871795 0.842105 0.105263 0.064516 0.866667 0.666667 ... 1.0 0.1 0.294118 0.25 0.0 0.0 0.0 0.55 0.0 0.0

5 rows × 860 columns


Exercise 2

Plot a timeseries graph for stations 3260 and 161 only

In [44]:
station_timeseries[ [161, 3260] ].plot(
    alpha=0.5,  
    legend=False, 
    figsize=(20,5), 
    xlim=('2019-10-10 06', '2019-10-10 06:30'),
    ylim=(0,1)
)
Out[44]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a62ac6160>

Finding Bike Stations with Similar Behavior

For our next analysis, we are going to try to find bike stations that have similar behaviors over time. A very simple technique that we can use to find similar time series is to treat the time series as vectors, and compute their correlation. Pandas provides the corr function that can be used to calculate the correlation of columns. (If we want to compute the correlation of rows, we can just take the transpose of the dataframe using the transpose() function, and compute the correlations there.)

In [45]:
similarities = station_timeseries.corr(method='pearson')
similarities.head(5)
Out[45]:
station_id 72 79 82 83 116 119 120 127 128 143 ... 3850 3851 3852 3853 3854 3855 3857 3858 3860 3861
station_id
72 1.000000 0.076472 -0.556920 -0.457938 0.095601 0.396191 0.242809 0.246403 0.194088 -0.390355 ... 0.161908 0.470888 0.390598 0.435296 -0.377070 -0.050026 0.003233 0.363587 0.288690 0.311658
79 0.076472 1.000000 -0.496689 -0.193591 0.062969 -0.132101 -0.002484 0.216853 -0.161228 -0.121802 ... -0.257629 -0.284970 0.068948 0.129691 -0.084766 -0.268215 -0.581908 -0.227574 0.199033 -0.342244
82 -0.556920 -0.496689 1.000000 0.608826 -0.126529 -0.204969 -0.282012 -0.564416 -0.183787 0.597209 ... 0.368346 -0.388465 -0.325867 -0.259176 0.279647 -0.083897 0.217835 -0.031311 -0.304655 0.103367
83 -0.457938 -0.193591 0.608826 1.000000 -0.069435 0.157530 -0.410199 -0.636583 -0.483544 0.438878 ... 0.495132 -0.167974 0.145556 -0.148183 -0.201039 -0.181565 -0.141873 0.256342 -0.412770 -0.520408
116 0.095601 0.062969 -0.126529 -0.069435 1.000000 -0.439817 -0.130314 -0.527030 0.357481 0.419083 ... 0.152847 0.517432 -0.123523 0.674887 0.348977 0.144448 -0.476454 -0.345427 0.143049 -0.112707

5 rows × 860 columns

Let's see the similarities of the two stations that we examined above.

In [46]:
stations = [161, 3260]

similarities[stations].loc[stations]
Out[46]:
station_id 161 3260
station_id
161 1.000000 0.309542
3260 0.309542 1.000000
In [47]:
# 393: E 5 St & Avenue C
# 2003: 1 Ave & E 18 St

stations = [393, 2003] 
    
similarities[stations].loc[stations]
Out[47]:
station_id 393 2003
station_id
393 1.000000 0.815269
2003 0.815269 1.000000

For bookkeeping purposes, we are going to drop stations that generate NaN values, as we cannot use such entries for our analysis.

In [48]:
# Number of stations with non-NaN similarity per station
check = similarities.count()
# Find the number of stations with less than the max number of similarities
todrop = check[check < check.max()].index.values
similarities.drop(todrop, axis='index', inplace=True)
similarities.drop(todrop, axis='columns', inplace=True)

Clustering Based on Distances

Without explaining too much about clustering, we are going to use a clustering technique and cluster together bike stations that are "nearby" according to our similarity analysis. For this, we need to first convert our similarities to distance.

We are now going to convert our similarities into distance metrics. Our distance values will be always positive, and bounded between 0 and 1.

  • If two stations have correlation 1, they behave identically, and therefore have distance 0,
  • If two stations have correlation -1, they have exactly the oppositite behaviors, and therefore we want to have distance 1 (the max)
In [49]:
# similarity goes from -1 to 1, so 1-similarity goes from 0 to 2.
# so, we multiply with 0.5 to get it between 0 and 1, and then take the square

distances = ((.5*(1-similarities))**2)
distances.head(5)
Out[49]:
station_id 72 79 82 83 116 119 120 127 128 143 ... 3848 3850 3851 3852 3853 3855 3857 3858 3860 3861
station_id
72 0.000000 0.213226 0.606000 0.531396 0.204484 0.091146 0.143335 0.141977 0.162373 0.483272 ... 0.232791 0.175600 0.069990 0.092843 0.079723 0.275639 0.248386 0.101255 0.126491 0.118454
79 0.213226 0.000000 0.560019 0.356165 0.219507 0.320413 0.251244 0.153330 0.337113 0.314610 ... 0.401100 0.395407 0.412787 0.216715 0.189359 0.402092 0.625608 0.376734 0.160387 0.450405
82 0.606000 0.560019 0.000000 0.038254 0.317267 0.362987 0.410889 0.611850 0.350338 0.040560 ... 0.306558 0.099747 0.481959 0.439481 0.396381 0.293708 0.152946 0.265900 0.425531 0.200988
83 0.531396 0.356165 0.038254 0.000000 0.285923 0.177439 0.497166 0.669601 0.550226 0.078714 ... 0.178303 0.063723 0.341041 0.182519 0.329581 0.349024 0.325968 0.138257 0.498980 0.577910
116 0.204484 0.219507 0.317267 0.285923 0.000000 0.518269 0.319402 0.582955 0.103208 0.084366 ... 0.078323 0.179417 0.058218 0.315576 0.026425 0.182992 0.544979 0.452544 0.183591 0.309529

5 rows × 849 columns

The clustering code is very simple: The code below will create two groups of stations.

In [50]:
from sklearn.cluster import AgglomerativeClustering
from sklearn.cluster import KMeans

cluster = KMeans(n_clusters=2)
cluster.fit(distances.values)
Out[50]:
KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
       n_clusters=2, n_init=10, n_jobs=None, precompute_distances='auto',
       random_state=None, tol=0.0001, verbose=0)

We will now take the results of the clustering and associate each of the data points into a cluster.

In [51]:
labels = pd.DataFrame(list(zip(distances.index.values.tolist(), cluster.labels_)), columns = ["station_id", "cluster"])
labels
Out[51]:
station_id cluster
0 72 1
1 79 1
2 82 0
3 83 0
4 116 0
... ... ...
844 3855 1
845 3857 1
846 3858 0
847 3860 0
848 3861 1

849 rows × 2 columns

Let's see how many stations in each cluster

In [52]:
labels.pivot_table(
    index = 'cluster',
    aggfunc = 'count'
)
Out[52]:
station_id
cluster
0 340
1 509

Visualizing the Time Series Clusters

We will start by assining a color to each cluster, so that we can plot each station-timeline with the cluster color. (We put a long list of colors, so that we can play with the number of clusters in the earlier code, and still get nicely colored results.)

In [53]:
colors = list(['red','black', 'green', 'magenta', 'yellow', 'blue', 'white', 'cyan'])
labels['color'] = labels['cluster'].apply(lambda cluster_id : colors[cluster_id]) 
labels.head(10)
Out[53]:
station_id cluster color
0 72 1 black
1 79 1 black
2 82 0 red
3 83 0 red
4 116 0 red
5 119 1 black
6 120 1 black
7 127 1 black
8 128 1 black
9 143 0 red
In [54]:
stations_plot = station_timeseries.plot(
    alpha=0.5, 
    legend=False, 
    figsize=(20,5), 
    linewidth=1,
    color=labels['color'],
    xlim=('2019-10-10 06', '2019-10-10 06:30'),
    ylim=(0,1)
)

The plot still looks messy. Let's try to plot instead a single line for each cluster. To represent the cluster, we are going to use the median fullness value across all stations that belong to a cluster, for each timestamp. For that, we can again use a pivot table: we define the communication_time as one dimension of the table, and cluster as the other dimension, and we use the median function.

For that, we first join our original dataframe with the results of the clustering, using the merge command, and add an extra column that includes the clusterid for each station. Then we compute the pivot table.

In [55]:
median_cluster = df.merge(
    labels,
    how='inner',
    on='station_id'
).pivot_table(
    index='lastCommunicationTime', 
    columns='cluster', 
    values='percent_full', 
    aggfunc='median'
)

median_cluster.head(15)
Out[55]:
cluster 0 1
lastCommunicationTime
2019-10-09 01:00:00 0.689899 0.428571
2019-10-09 01:00:01 0.563636 0.173913
2019-10-09 01:00:02 0.761111 0.173913
2019-10-09 01:00:03 0.840000 0.523810
2019-10-09 01:00:04 0.433333 0.238095
2019-10-09 01:00:05 0.944444 0.364802
2019-10-09 01:00:06 0.166667 0.222222
2019-10-09 01:00:07 0.533333 0.318182
2019-10-09 01:00:08 0.666667 0.606061
2019-10-09 01:00:09 0.541667 NaN
2019-10-09 01:00:10 NaN 0.851852
2019-10-09 01:00:11 0.717949 NaN
2019-10-09 01:00:12 NaN 0.000000
2019-10-09 01:00:13 NaN 0.074074
2019-10-09 01:00:14 NaN 0.618261

Now, we can plot the medians for the two clusters.

In [58]:
median_cluster.plot(
    figsize=(20,5), 
    linewidth = 2, 
    alpha = 0.75,
    color=colors,
    ylim = (0,1),
    xlim=('2019-10-10 06', '2019-10-10 06:05'),
    grid = True
)
Out[58]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a40c6fb00>

And just for fun and for visual decoration, let's put the two plots together. We are going to fade a lot the individual station time series (by putting the alpha=0.005) and we are going to make more prominent the median lines by increasing their linewidths. We will limit our plot to one week's worth of data:

In [59]:
stations_plot = station_timeseries.plot(
    alpha=0.005, 
    legend=False, 
    figsize=(20,5), 
    color=labels["color"]
)

median_cluster.plot(
    figsize=(20,5), 
    linewidth = 3, 
    alpha = 0.5,
    color=colors, 
    xlim=('2019-10-10 06', '2019-10-10 06:05'),
    ylim=(0,1),
    ax = stations_plot
)
Out[59]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a42816be0>